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0> . Abstract 



We solve a fractional diffusion equation using a piecewise-constant, 
discontinuous Galerkin method in time combined with a continuous, 
piecewise-linear finite element method in space. If there are A^ time 
levels and M spatial degrees of freedom, then a direct implementation 
'3 ' of this method requires 0{N'^M) operations and 0{NM) storage, ow- 

C^ \ ing to the presence of a memory term: at each time step, the discrete 

evolution equation involves a sum over all previous time levels. We 
show how the computational cost can be reduced to 0(MA^log A^) op- 
erations and 0(M log A^) active memory locations. 
> 
(N 

O ; 1 Introduction 

^, 

rn . The density u = u{x,t) of particles undergoing anomalous subdiffusion sat- 

isfies the integro differential equation [7] 



t<^' *>-''■ (I /'^^'^'^"'^■^>'^'=^<^'*' 



c^ , for a parameter z/ in the range < z/ < 1, where K > is a generalized 

diffusivity and / is a homogeneous term. We consider ([T]) for < t < T and 
for X in a bounded, convex or C^ domain Q C M'^, subject to homogeneous 
boundary conditions, either of Dirichlet type. 



or of Neumann type. 



u{x,t) = ioTxedn, (2) 



— {x,t) = ioTxedfl, (3) 
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where n denotes the outward unit normal for Q. In addition, we specify the 
initial condition 

u{x, 0) = Uo{x) for X E ft. 

In the limit as z/ — > 1, the evolution equation ([T]) reduces to the classical 
diffusion equation, m^ — V ■ (KVu) = /, in which the flux —KVu depends 
only on the instantaneous value of the gradient. By contrast, in ([T]) the flux 
depends on the entire history of the gradient, and this fact leads to significant 
computational challenges, particularly if the spatial dimension d = 3. 

In [5], we solved the foregoing initial-boundary value problem using a 
piecewise-constant, discontinuous Galerkin (DG) method for the time dis- 
cretization, combined with a standard, continuous piecewise-linear finite el- 
ement discretization in space. For simplicity, in the present work we assume 
that both the spatial mesh and the time steps are quasiuniform. Defining 
the elliptic partial differential operator Au = —V ■ {KVu), we assume that 
for some a G (0, 1] the solution u satisfies regularity estimates of the form [1] 

t''\\Au'{t)\\+f'+^\\Au"{t)\\<Cr-^ and \\Au{t)\\ +t\\Au'{t)\\ < C (4) 

for < t < T. The DG solution U{x, t) therefore satisfies an error bound [5], 
Theorem 3] in the norm of L2{fi), 

II [/" - u{t„) II < C{k'' + h"^) lg(t„/ti) for 1 < m < M and 1 < n < iV, (5) 

where t„ is the nth time level, U"'{x) = U{x,t^), k is the maximum time 
step, h is the maximum diameter of the spatial finite elements and lg(s) = 
max(l, I logs|). When cr < 1, we could achieve full first-order accuracy with 
respect to k (ignoring logarithmic factors) by relaxing the assumption that 
the time steps are quasi-uniform, but to do so would complicate our fast 
solution procedure. 

Section [2] provides a precise description of the DG method, which can be 
interpreted as a type of implicit Euler scheme. Denote the mth free node 
by Xm, and let U^ = U{xm, t~). At the nth time step, we compute the vector 
of nodal values, C/" = [f/^] G M*^, by solving a linear system 

n-l 

(M + /3„„5)C/" = MC/"-i + fc„r + SY,Pn,U^- (6) 

i=i 

Here, M and S are the mass and stiffness matrices arising from the spatial 
discretization, fc„ = t„ — tn-i is the length of the rath time interval, /" is the 
average value of the load vector for t„_i < t < t„, and the weights are given 

by 



"»"=' .^%^^^ = f(rt^ w 



The condition number of the matrix M+PnnS is 0{l+k'^h ^), and we assume 
the use of an efficient elhptic solver costing 0{M) operations. By compari- 
son, computing the right-hand side in the obvious way requires 0{nM) op- 
erations. Moreover, we must keep the vector U^ in active memory for all the 
previous time levels j = 1, 2, ..., n — 1, requiring 0{nM) locations. Thus, 
the total cost for A^ time steps is 0{N'^M) operations and 0{NM) active 
memory locations, whereas applying the same DG method to a classical dif- 
fusion equation (which has no memory term) costs only 0{NM) operations 
and 0{M) active locations. In other words, solving the fractional diffusion 
equation in this way costs A^ times as much as solving a classical diffusion 
equation. 

Cuesta, Lubich and Palencia [1] studied the time discretization of ([1]) 
by convolution quadrature, and Schadle, Lopez-Fernandez and Lubich [SI [2] 
developed a fast solution algorithm costing O(MA^logA^) operations and 
using 0(M log A^) active memory locations. The purpose of this paper is 
to present a fast summation algorithm for the DG method ([6]) that likewise 
costs O(MA^logA^) operations and 0(M log iV) active memory locations. 

The algorithm is closely related to the panel clustering technique for 
boundary element methods, introduced by Hackbusch and Nowak [3J. To 
explain the basic strategy, suppose that instead of (t — s)'^~^/r(z/) the inte- 
gral term had a degenerate kernel ^p=i a"p(t)bp{s) . In this case, 

r 

^"i = XI *^P"^Pi for 1 < j < n < A^, (9) 

p=i 

where (j)np = o,p{tn) — ctpitn-i) and ipnj = J^' bp{s) ds. Since 

n—l r n—1 

^/?.,t/^ = ^0,„vl/«-i(C/) where ^;-\U) = Y,^p,U^ , 
j=i p=i j=i 

and since '^p{U) = \E'p"^(C/) + ippnU^, evaluating the right-hand side of 
the linear system ([6]) would cost only 0{rM) operations and there would be 
no need to retain in active memory the solution at all previous time levels. 
Our kernel (t — sY~^ /T{v) is not degenerate, but it can be approximated 
to high accuracy by a degenerate kernel if we restrict t and s to suitable, 
well-separated intervals. Consequently, if t„ and tj are restricted in the same 
way, then (3nj can be approximated to high accuracy by a sum (3nj of the 



form on]), leading to a fast method to evaluate the sum that occurs on the 
right-hand side of (jH]). 

In Section [2] we investigate the effect of perturbing the DG method in this 
way, and Section [3] presents a simple scheme for generating the Pnj via Taylor 
expansion. Section H] describes the cluster tree, whose nodes are contiguous 
families of time intervals, used to appropriately restrict t„ and tj. The fast 
summation algorithm is then defined via the concept of an admissible cover- 
ing, and we present an error estimate in Theorem I4.3[ Further investigation 
of the cluster tree in Section [5] allows us to prove, in Theorem 15.31 that 
the algorithm requires O(A^MlogA^) operations. In Section [6|, we present 
a memory management strategy and show, in Theorem 16.11 that with this 
strategy the fast summation algorithm uses at most 0(M log A^) active mem- 
ory locations during each time step. Section [7] presents a numerical example, 
and the paper concludes with three technical appendices. 

Although presented here only for equation ([1]) discretized in time using 
a piecewise-constant DG method, the fast summation algorithm does not 
depend in any essential way on this specific choice and could be used for many 
other time stepping procedures for evolution problems with memory. The key 
requirements are that the quadrature weights are computed via local averages 
of the kernel and that, away from the diagonal, the derivatives of the kernel 
exist and decay appropriately. Also, the approximation scheme of Section [3], 
based on Taylor expansions, is only one possibility, chosen because it is simple 
to analyse for the kernel in ([T]). We could instead use an interpolation scheme 
that requires a user to supply only pointwise values of the kernel. 

2 Numerical method 

We define Ui,{t) = f^^^/Tlu) and denote the Riemann-Liouville fractional 
differentiation operator of order 1 — z/ by 

d /■* 
Bv{t) = {uy *v)t = J- I ujy{t - s)v{s) ds, 

and let ut = du/dt so that, suppressing the dependence on x, 

ut + BAu = f{t) for < t < T, with u{0) = uq. (10) 

We also denote the inner product in ^2(^2) and the bilinear form associated 
with A by 

{u,v) = / uvdx and A{u,v) = / KVu-Vvdx. 
Jn Jq 



The weak form of f llOl) is then 

{ut,v) + A{Bu,v) = {fit),v) for all test functions v G L2((0,T), ij^), 

where H^ is the Sobolev space Hq{Q) in the case of Dirichlet boundary 
conditions ([2]), and H^{Q) in the case of Neumann boundary conditions ([3]). 
The numerical solution U{t) ^ u{t) is a piecewise-constant function of t, 
with coefficients belonging to a continuous, piecewise-linear finite element 
space Sh C H^. Thus, U generally has a jump discontinuity at each time 
level tn, and we adopt the usual convention of treating U(t) as continuous 
for t in the half-open interval /„ = (t„_i,t„], writing 

u- = u{t^) = uit-), ui = u{tt), [ur = ui- u\ 

The DG time-stepping procedure is determined by the variational equation 
{Ul-\v:-')+ [ [{U'{t),V{t)) + A{BU{t),V{t))]dt 

Jin 

= {U--\V:-')+ f {f{t),V{t))dt, 

Jin 

which must hold for every piecewise-constant test function V with coefficients 
in Sh- In our piecewise-constant case, U!l~^ = f/" so this variational equation 
reduces to finding f/" G Sh such that 

,x) + A(i3"[/,x) = (r,x) for all xe 5,, 

where /" = k~^ Jj f{t) dt and 

B^U =^ I BUit) dt = ^ (pnnU'^ - Y, Pr.,U^^ ■ (H) 

^ J In ^ \ 7^1 ^ 

Thus, the vector of nodal values of U"' satisfies (|6]), with weights given by 
dZD and dH]). 

Our fast algorithm approximates B"'U with 

ra-l 

'] (12) 



B^U = ^hnnU'' -J^i^njU^y 



where (3nj ~ Pnj for 1 < j < n — 1, and yields [/"" satisfying the perturbed 
equation 



,x)+A(i3"f/,x) = (r,x) for all xe^,, (13) 



with t/o = t/°. 



Theorem 2.1. //, for every real-valued, piecewise- constant function V , 

N 

5^A;„(^"r)V">0, (14) 

n=l 

then the perturbed problem (TT3|) is stable: 

n 

||^"|| < ||^°|| +2'^kj\\f\\ forl<n<N. 

Proof. This estimate follows by a simple energy argument [5l Theorem 1] . D 
In Appendix \^ we prove a lower bound 

TV N 

J2 kn{B''V)V^ > p,T^-' J2 kniV'f, (15) 

n=l n=l 

where the constant p,^ > depends only on z/, allowing us to show the 
following. 

Corollary 2.2. The perturbed DG method flT^ is stable if 

n-l 

Y^ Wnj - Pnj I < P.T'-^K for2<n<N, 
i=i 

and 

N 

J2 \~Pn, - Pnj\ < P.T^-% forl<j<N-l. 

n=j+l 

Proof. Write 

N N N n-l 

A = J2 kniB^V)^- - J2 kn{B^V)Vr, = J^ 5^(/3„, - Pn,)VjV^, 
n=l n=l n=2 j=l 

and observe that since |\/^V"| < ^{V^f + ^{Vf, 

N-l N N n-l 

TV 



j=l n=j+l n=2 j=l 



<PuT''-'J2^n{Vn'- 



n=l 

D 



To estimate the effect on the solution f/" of perturbing the weights (3nj, 
we introduce the piecewise-constant interpolation operator 

Ilu{t) = u(tn) for tn-l < t <tn- 

In the next result, for simplicity we treat the semi-discrete DG method in 
which there is no spatial discretization; the fully-discrete method could be 
analysed using the methods of McLean and Mustapha O Section 5] . 

Theorem 2.3. Assume that Sh = H^ and U^ = uq. If the perturbed equa- 
tion flT3l) is stable, and if 

n j~l 

j=i 1=1 

then 

n 

||f/"-M(t„)|| <2e„ max \\Au{tj)\\ +2^" kM^'Aiu -I[u)\\. (16) 



Proof. Since U satisfies flT3|l and u satisfies 

'u{tn) -M(t„„i) 



the difference W^ = U'^ — W^ satisfies 



x)+/\{B-u,x) = {r,x) for all xe^,, 



, ^\ + A(i3"iy, x) = {r. X) for all x G ^h, 

where g"" = (S" - B'')UAu + B'^A{u - Tin). Noting that Vr° = 0, stability 
of the perturbed equation implies that 

n 

and from flTTl) and flT2l) we see that 



{B^ - B^)liAu = kJ^Y^ih - l^ji)Au{tj). 

1=1 

D 
The second term on the right-hand side of flT6|l does not involve /3„j and 

is o(r) [g. 



3 Taylor approximation of the weights 

Recall from ((Tj) that /3„„ = ui+i^{kn), and from (jHj) that for 1 < j < n — 1, 

I3nj = / [uj^{tn-i - s) - u^{tn - s)] ds = - / u„_i{t - s) dt ds. (17) 



Denote the midpoint of the interval /„ by t„_i/2 = |(in-i + tn), and define 

ffc/2 
'-k/2 



rk/2 
B^{t, k) = u^{t + s)ds = uJi+^{t + \k) - uJi+^it - \k) 

J -kl2 



for t > k/2 and — oo < /i < oo. The approximate weights f3nj are determined 
as follows. See Appendices [B] and O for notes on the stable evaluation of 
these quantities. 

Theorem 3.1. Let < r] < 1 and suppose that < a < b < c < d <T with 
Ij^ia,b], 4C(c,ci], ^J<V- (18) 

Denote the midpoint of [a, b] by s = ^{a + b) , and define 

(Ppn = (-l)^fiz.-p(t„,-l/2 - S, kn), Ippj = Bp{tj_i/2 - S, kj) , 

and 

r 

Pnj / J (PpnWpj • 

p=l 

Then 

Proof. Taylor expansion about s gives 

r-l 

(^u-i{t - s) = ^(-l)Pu;j,_i_p(t - s)uJi+p{s - s) 

p=0 



+ {-ly / ujr{s - y)uy_i_r{t - y) dy, 
so, by flT7|l . integrating over t E In and s G Ij gives /3„j = /3nj + Emj with 

Ernj = {-ly^^ / / UJr{s - y)uJu^l-r{t - y) dy ds dt. 

J In J li J S 



In -J Ij 
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Since y E Ij (^ (a, b] and t E In ^ (c, rf], we have t — y > c — b, so 

dy 



-Ir— 1 
S — S 



\E |<^^-^ 

™^l- |rfzy-r-l)L/,„J,^, r! 



(t - 2/)2-- 



dsdt. 



If s < s, then y<s so t — y>t — s. However, if s < s then < y — s < 
s — s < ^{b — a) so 



t - y = {t - s) - {y - s) > {t - s) - Ub - a) = {t - s)( 1 



lib -a] 
t-s 

Also, t — s > c — b because s G [a,b], and assumption (TTSj) imphes that 
t-y > (t-s)(l- ir]). Thus, 



dy 



(t-y) 



2-u 



dy 



< 



\s — s\ 



(l-|r7)2- (t-s)2-' 



and since |s — s| < |(6 — a) < ^rjic — b), it follows that 



2-u 



ivm' 



dsdt 



'^- \2-r^J r\\T{u-r-l)\JjJj^{t-sy-^- 

The identity r(z/ — 1) = {u — 2){i' — 3) ■ ■ ■ {u — r){u — r — l)r{u — r — 1) shows 
that 



T(u-1] 



rlTiu — r — 1] 



2 — u 3 — V r — V , 
< —^ (r + 1 - z/) < r + 1, 



r 



and because r(z/ — 1) = r(z/)/(z/ — 1) < 0, 
\Ernj\<2^-'{r + l){ri/2) 



(t-s) 



u-2 



I. Ju r(^ - 1) 



dsdt 



U 



Whereas ■j/'pj depends on [a, b] (via s), in the following alternative expan- 
sion ifj^j depends only on p and j. However, the sum that defines 0*^ is 
susceptible to loss of precision. 



Corollary 3.2. If we define 

{-sy-p 



E 



q=p 



■'qn 



and ippj = Bp{tj_i/2, k_ 



Vi 



then 



p=0 
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Proof. The binomial expansion gives 



[s-s)P-^ . x:^ (-s)P-i f s"-! 



"^'^ J I, (p-1)! "^^ ^,iP-q)diAQ-iy-'^' 



so 



P-9\ /" s?-! 



^-^ = E E^"?;— 7;^ / 7;73w^'- 



9=1 ^p=q 



{p-qVj JiAl-^V- 



D 



4 Cluster tree 

We introduce the notation 

C(j, n) = {Ij, /j+i, ...Jn} ioT l<j <n<N, 

and refer to any such set of consecutive subintervals as a cluster. A cluster 
tree for the mesh = to < ti <■■■< tj\f = T is a. tree T having the following 
properties: 

1. each node of T is a cluster; 

2. the root node of T is C(l, A^); 

3. each node is either a leaf or else equals the disjoint union of its children. 

Let £ denote the set of leaves in the cluster tree T, and observe that for 
any two distinct nodes Ci 7^ C2 of T exactly one of the following holds [21 
Remark 3.4]: Ci C C2; C2 C Ci; or Ci n C2 = 0. 

In the obvious way, we define the length of a cluster C to be the length 
of the underlying interval |J C; thus, 

Len(C) = tn — tj-i if C = C(j, n). 

The distance between two clusters is the Euclidean distance between the 
underlying point sets in M, 

Dist(Ci, C2) = inf { \t - s\ : t e[jCi and s G |J ^2 }, 

so in particular, Dist(C(ji,ni), C(J2,'n'2)) = tj^^i — tn^ if j2 > ni. We define 
the history of a cluster to be the entire preceding half-open interval: 

History(C) = (0, tj_i] if C = C(j, n). 
10 



Given a leaf L G £, and a parameter rj in the range < r/ < 1, we say 
that a cluster C is {L,ri)-admissable if 

IJ C C History(L) and Len(C) < r7Dist(C, L). 

Thus, the conclusions of Theorem 13.11 hold for Ij G C and /„ G L with 
IJ C = (a, 6] and IJ L = {c,d]. Notice that if C is (L, ?7)-admissible, then so 
are all the descendents of C. 

An {L,rj) -admissible cover is a set C of clusters such that 

1. each cluster C G C is a node of T; 

2. [J{I -.1 eCeC} = History(L); 

3. if Ci, Ca G C with Ci ^ Ca, then Ci n Ca = 0; 

4. if C G C then either C is (L, ?7)-admissible, or else C G £. 

A trivial example of an (L, r7)-admissible cover is the set of leaves in the 
history of L, that is, 

C = { L' G £ : L' C History(L)}. 

Lemma 4.1. All {L,r]) -admissible covers contain the same non- admissible 
leaves. 

Proof. Let Ci and Ca be (L, 77)-admissible covers, and suppose that L' G CiHC 
is not (L,77)-admissible. Since L' C History(L) = [J{I : / G C G Ca }, there 
exist / G C G Ca such that I intersects at least one interval /' G L'. Since L' is 
a leaf of T, it follows that L' C C and hence C is not (L, r^)- admissible. Thus, 
C must be a leaf, because Ca is an (L, ?7)-admissible cover, and we conclude 
that C = L'. D 

Algorithm 14. II defines a recursive procedure, divide{C, C, L, rj), that we use 
to define a family of clusters A^ ( L) , as follows: 

C = C(l,Ar),C = 

divide{C,C, L, 77) 

M{L)=C 
This construction is a modified version of [21 (3.8b)]. 

Theorem 4.2. A^(L) is the unique minimal {L,r])- admissible cover. 
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Proof. Suppose that C is an (L, r7)-adinissible cover. Let M G A^(L) and let 
C be any node of C that intersects M. If M is a leaf, then M is not {L,ri)- 
admissible so M = C by Lemma 14.11 If M is not a leaf, then either C is a 
leaf, in which case C C M, or else C is (L,?7)-admissible and C C M by the 
construction of A^(L) using Algorithm 14.11 Hence, in all cases C C M, so 
either C = Mil) or else #A^(L) < #C. D 

Algorithm 4.1 diVide(C, C, L, 77) 

Determine a, b, c, d such that IJ C = (a, b] and IJ L = (c, d\ 

C = 

if a < c then 

if 6 < c and b — a < ri[c — b) then {C is (L, r7)-admissible} 

C = CU{C} 
else if 6 < c and C E C then 

c = cu{c} 

else 

for all C G Children (C) do 

divide{C' ,C, L, 77) 
end for 
end if 
end if 

The minimal admissible cover A^(L) is the disjoint union of the sets 

Near(L) = { C G M{L) : C is not (L,-?])- admissible }, 
Far(L) = { C G A^(L) : C is (L, r/)-admissible }. ^ ' 

Given n, there is a unique leaf L = L„ containing /„, and we define 0p„(C) and 
ippn{C) using the formulae of Theorem 13.11 with (a, b] = [JC and (c, <i] = IJ L„. 
We then define 

Pnj = /3nj if /j G L„or Jj G CGNear(U), (20) 



and 
so that 

n-l 



/3„, = J]0p„(C)^p,(C) if/,GCGFar(U), (21) 

p=i 



J2^n,V' =^n{\-n,V)+ J^ ^n{C,V) + ^ S„(C,V), (22) 

i=l CeNear(L„) CeFar(L„) 
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where 

i::^{Q,V) = Y, Pn,^' and Y:n{<^,V) = Y,~Pn,y'- 

j<n~l 

To evaluate S„(C, V), we compute 

r 

S„(C,V) = ^0p„(C)^p(C,r) where ^^(C, V^) = ^ V^p,(C)V^'. 
p=i /jGC 

The resuhs of Sections [2] and [3] now yield the following estimate for the 
additional error incurred by using the approximating sum (122|) . Recall that 
Pu is the constant appearing in the lower bound (TT5l) . 



Theorem 4.3. Define f3nj according to Theoreni \3.1\ fl20l) and fl2Tl) . and put 
/Cmin = mini<„<jv ^n- r/ie perturbed problem flT^ ^s stable if 

(r + l)(77/2)^ < 2'^-2r(z/ + l)p,(A;^i,/^)l-^ 

m which case the error estimate (ITBl) /or t/ie semidiscrete DG method holds 
with 

en = Yry^{r + l){'n/2ynk'' forl<n<N. 



Proof. Recalling (|8]), we have 



^^"' X V r(^) r(.) )"' 



and 



r(z/ + i) -r(z/ + i) 



'•^^ (i^-^r-'-iT-jTlds 






r^/ 



kf-[iT-t,.,r-iT-t,r] ^ fcj 



r(i/ + i) -r(z/ + i)' 

Thus, by Theorem 13.11 



r(z/ + i) 
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and 

n=j+l ^ ' 

Therefore, Corollary 12.21 shows that the perturbed scheme is stable if 

22-'^(r + l){r^/2r ^ J^_^ < p.T'-'K for 1 < n < iV, 
and since 

n j-1 ^ ^ k^ 

E E Wr, - Pn3\ < l'-\r + l)(,/2)^ Yl fUjfTY 

j=l 1=1 j=l ^ ' 

the error bound follows at once from Theorem 12.31 D 

Since A^^^ < Ck^^^/T because the mesh is quasiuniform, we have stability 
if (r + l){ri/2Y < CN'^^^, and the error factor satisfies 

en = 0{k) if {r + l){r]/2Y<CN''-\ (23) 

The alternative expansion from Corollarv 13.21 gives 0p„(C) and ippj such 
that 

r 

^nj = Yl 'I'lniQ^pj if ^j- e C G Far(L„), 
p=i 
so 

r 

i:4C,V) = Y<P;niQ%iC,V) where %{C,V) = Y^p.VY 

P=l /j6C 

If C is not a leaf, then C is the union of its children, so 

C'eChildren(C) 

Thus, once we compute '^*{C,V) for every leaf C of T, we can compute 
\l/p(C, V) for the remaining nodes C by aggregation. 

5 Computational cost 

We now seek to estimate the number of operations required to evaluate the 
right-hand side of (122|) . Let Gen(C) denote the generation of the node C G T, 
defined recursively by 

Gen(C)=0 ifC = C(l,iV), 
14 



and 

Gen(C) = Gen(Parent(C)) +1 if C ^ C(l, A^). 

Put 

Mi{L) = {CeMiL): Gen(C) = i}, 

and note the A^o(L) = since the root node of the cluster tree cannot 
be (L,?7)-adniissible. We formlate the following regularity condition for the 
cluster tree. 

Definition 5.1. For integers G > 1 and Q > 2, we say that T is {G,Q)- 
uniform if, for some constants < A < A and for every node C E T, 

i. < Gen(C) < G; 

2. 7^ Children(C) = Q whenever C ^ C; 

3. XTQ-^ < Len(C) < ATQ-^ whenever C G Ge; 
I [jge=[0,T]forO<i<G. 

For example, given a uniform mesh with N = 2^ subintervals and G < P, 
recursive bisection of [0, T] for G generations leads to a {G, 2)-uniform cluster 
tree in which each leaf contains N/2^ = 2^~'^ subintervals. 

We assume henceforth that T is (G, (5)-uniform. Since ij^Q^ = 1 and 
i^Ge+i = Q X i^Qi, we see that ^Qi = Q^, implying that 

i^r = J2Q' = ^T_~, ^ < 2(#^) - 1- (24) 

Also, £ = ^G so 

The next result shows that 4fM{L) = 0{r]'^QG). 

Lemma 5.2. Suppose that L G £ and C G A^£(L). // |J C = (a, 6] and 

IJ L = {c,d], then 

c - (1 + r]-^)ATQ-^+^ <a<b<c- r]-^XTQ-^ when l<i<G-l, 

whereas 

c-{l+r]-^)ATQ'^ <a<b<c when^ = G. 

Therefore, 

^ ' - X 1(1 + 7^-1) for£ = G. 
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Proof. Ifl<£<G' — 1, then C is (L, ?7)-adniissible so 

XTQ-^ < Len(C) = b - a < r] Dist(C, L) = r]{c - b) 

and thus b < c — rj~^XTQ~^. Suppose for a contradiction that a < c — AT(H- 
r/-i)Q-^+\ and let C = Parent(C). If J C = (a', 6'], then a' < a < b < b' 
and Gen(C') = £ — 1, so 

y <a' + Len(C') < a + ATQ-^+^ <c- 7]-^ATQ~^+^ 

and 

Len(C) < ATg-^+i < 77(0 - b') = r] Dist(C', L), 

showing that C' is (L,r7)-adniissible, which is impossible because C G A^(L). 
Thus, 

i^Mi{l) X XTQ-^ < J2 Len(C) < (1 + r]-^)ATQ-^+^ - r]-^\TQ-^ 

C€Me{L) 

and #Me{\-) < (1 + r7-i)(A/A)g - r]-\ 

Now let £ = G and suppose for a contradiction that a < c — (1 + 
77-i)ATg-^. Since 

b<a + Len(C) < a + ATQ"^ < c - rj-^ATQ-^ 
it follows that 

Len(C) = 6 - a < r]ATQ-^ < n{c ~ b) = rj Dist(C, L), 
so C is (L,-?])- admissible, which is impossible because C is a leaf. Thus, 
#A^g(L) X XTQ-^ < Yl Len(C) < (1 + t]-^)ATQ-^ 

and#7WG(L)< (l + r/-i)(A/A). D 

As a straight forward consequence, we obtain the desired operation counts. 

Theorem 5.3. IfT is {G,Q)- uniform, then the right-hand side of (1221) can 
be computed for 1 < n < N in order rri~^MN{QG + NQ~'^) operations. 

Proof. If Gen(C) = i, then the number of subintervals in C is NQ"^ so 
computing S„(C) requires 0{NQ~^M) operations. Since 

#Near(U) = #A^c(U) = O(r^) 
16 



and i = G whenever C G Near(L„), we see that the total cost for all of the 
near-field sums is 0{r]^^NQ^'^M). The sum S„(C, V^) costs 0{rM) opera- 
tions, and since 



G-l 



#Far(L„) = J^ #A1,(U) = O(r'QG) 



1=1 



the total cost for all of the far-field sums is 0{ri^^QGrM) operations. In ad- 
dition, computing \E'p(C, V^) for every C with Gen(C) = i costs 0{NM) op- 
erations, so computing this sum for 1 < p < r and < i < G — 1 costs 
0{rGNM) operations. Thus, the overall cost for N time steps is of order 
A^ X {r]-^Q-^NM + r]-^rQGM) + rGNM operations. D 

If we choose G = P = logg A^ so that A^ = Q'-' and each leaf contains 
only a single subinterval, then the cost is 0{rri~^QMN\ogN), as claimed in 
the Introduction. In practice, the overheads associated with the tree data 
structure mean that it may be more efficient to choose G < P. 

6 Memory management 

From (121|) we have iJ^iT \ C) < Q'^ , which implies that to store \E'p(C, f/) 
for all C G T \ £ and 1 < p < r we require 0{rQ^M) memory locations. 
Storing [/" for 1 < n < A^ requires a further 0{NM) locations. However, 
at the nth time step only a small fraction of this memory is active, in the 
sense that the data it holds play a role in computing [/". Figure [1] illustrates 
a (Q, G)-uniform cluster tree with Q = 2 and G = 6. The black cluster 
is the current leaf L„, and the red clusters belong to the minimal (L^,?])- 
admissible cover A^(L„). As we will now explain, memory associated with 
the green and red clusters is active, whereas that associated with the blue 
and magenta clusters is not. 

For each cluster C G T, either there is no n such that C G A^(L„), or 
else there is a unique smallest n = njnin(C) such that C G A^(L„). Moreover, 
if C ^ A^(L„) for some n > njnm(C), then an ancestor of C must belong 
to Ai{Ln) and we have C ^ Ai{\-n') for every n' > n. Hence, there is also a 
unique n = ramax(C) such that 

C e M{ln) if and only if nmin(C) < n < nmax(C), 

so, if C is not a leaf, the sums 

^{C, U) = Vl/(C) = [vI/i(C), VI/2(C), . . . , ^>r{Q)] 
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Figure 1: Cluster tree with L„ shown in black, the minimal admissible 
cover A4{Ln) in red and the other active clusters in green. The blue clusters 
are not yet active and the magenta clusters are no longer active. 

contribute to the far-field sum S„(C, U) if and only if ninin(C) < n < nmax(C). 
We can therefore deallocate the 0{rM) memory locations used to store ^E'(C) 
once U"' has been computed for n = ramax(C). 
For each n, define a subtree 



Tn = {C E T\ C -.IJC intersects In }, 



so that '^p{C,U) includes a term in U^ if and only if C G 7^. In Algorithm [6TT1 
after computing [/" we update all far-field sums \E'p(C) with C G Tn, so that 
[/"" is subsequently needed only for computing near-field sums. In this way, 
we can deallocate the 0{NQ~'^M) memory locations used to store U^ for 
Ij G L once f/" has been computed for n 



^max(L)- Algorithm 16.21 defines a 
recursive procedure free(C, V) that deallocates the memory associated with 
the children of C, and with their descendants if not already deallocated. 

Theorem 6.1. The number of active memory locations used during the ex- 
ecution of Alaorithm \6. 1\ is never more than 0{rri~^QGM). 

Proof. Suppose that the memory associated with C is active during the nth 
time step, and that |J C = (a, b] and |J L„ = (c, d]. (So in Figure [H C is green 
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Algorithm 6.1 Time stepping and memory management. 
for n = 1 to A^ do 

Find M.{\-n) using Algorithm 14.11 
for slM C e M{Ln) \ C do 
for all C' a child of C do 

free(C', U) 
end for 
end for 

Compute B^'U using (^ 
Allocate f/" and solve (Il3]) 
Write [/" to disk 
for all C G 7^ do 

if ^E'(C) is not allocated then 

Allocate ^^(C) and initialize to 0. 
end if 

for all p G {1, 2, . . . , r} do 
^p(C) = ^p(C) + ^p„f/- 
end for 
end for 
end for 



Algorithm 6.2 free(C, V) 



if C G £ then 
for all Ij G C do 

Deallocate V^ 
end for 
else if \1/(C) is allocated then 
for all C' a child of C do 

free{C, V) 
end for 
Deallocate ^(C) 
end if 
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or red or black.) If Gen(C) = £, then by Lemma [5.2^ 

c - (1 + r]'^)kTQ-^+^ <a<b<d + KTQ-\ 

and since d — c = Len(L„) < ATQ^^ and Len(C) > XTQ^i, we see that the 
number of such clusters is at most 

{d-c)+ ATQ-'[1 + (1 + r]-')Q] ^ A 



XTQ- 



<j{q'~'' + i + {i + v-')q) 



<jii + v-')iQ + i)- 



Storing \l'p(C, U) requires rM memory locations, so the desired estimate 
follows after adding the contributions for 1 < i < G. D 



Since G < P = logqN, Theorem 16.11 justifies the claim in the Introduc- 
tion that the memory requirements are proportional to M log N. Theorems 
15.31 and 16.11 show that — for a given choice of M and A^ and a given cluster 
tree — the computational cost, both with respect to the number of opera- 
tions and to the number of active memory locations, is proportional to r/rj. 
At the same time, by Theorem 14. 3[ to achieve the desired accuracy we must 
ensure that (r + l){ri/2Y is sufficiently small. The next result shows the 
relation between r and rj that is optimal in the sense of achieving a given 
accuracy for the least computational cost. 

Proposition 6.2. For a given 6 > 0, the ratio r/rj is minimised subject to 
the constraint (r + l){r]/2y' = 6 by choosing 

'; = 2exp(-^)=2(^)'", (25) 

Proof. Introducing the Lagrangian L = rri~^ + fi{r + l){r]/2Y, we obtain the 
necessary conditions 

r]^^ + fi{r]/2Y[l + {l + r)\og{r]/2)] = and -rr]'^ + fi{r + l)rr]'^-^2-'^ = 0, 

so fi{r]/2Y = r]-^/{r + 1) and r + 2 + (r + 1) log(r7/2) = 0. D 

Thus, we should choose successive values of r until the second inequality 
in (123 p holds, with rj given by (I2S]) . Since rj'^ < e^/2, the computational 
cost is then proportional to r. 
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Slow 




Fast 




r 


— 


4 


5 


6 


V 


— 


0.6024 


0.6228 


0.6378 


Error 


0.129E-03 


0.136E-02 


0.129E-03 


0.129E-03 


Setup 


49.6 s 


0.57s 


0.57s 


0.62 s 


RHS 


910.9 s 


15.45 s 


17.68 s 


20.55 s 


Solver 


7.2 s 


6.96 s 


6.84 s 


6.68 s 


Total 


967.7 s 


22.97s 


25.10 s 


27.85 s 



Table 1: Performance of slow and fast methods with N 
and M = 6241 spatial degrees of freedom. 



16000 time steps 



7 Numerical example 



Consider a simple test problem in d = 2 spatial dimensions, with u = 1/2, 
T = 6, fi = (0, 1) X (0, 1) and homogeneous Dirichlet boundary conditions ([2]). 
We take K = l/(27r^) so that the smallest eigenvalue of the elliptic opera- 
tor Au = —V ■ (KVu) is All = 1- We choose the initial data and source 
term 

uo = V'li and f{t) = (1 + simTt)Lpii, 

where (pu{x) = (sin 7rxi)(sin 11x2) is an eigenfunction oi A with eigenvalue An. 
The exact solution of ([T]) then has the separable form u{x,t) = unit)ipii{x), 
and we can compute the time-dependent factor 'Uii(t) to high accuracy by 
applying Gauss quadrature to an integral representation [51 Section 6] . More- 
over, the regularity estimates (jl]) hold with a = 1, so by ([5]) the L2-error in f/" 
is of order {k + h"^) lg{tn/ti). 

Table [1] shows some results of computations performed using a single- 
threaded Fortran code running on a desktop PC with an Intel Core-i7 860 
processor (2.80GHz) and 8GB of RAM. In all cases the spatial discretization 
used bilinear finite elements on a uniform 80 x 80 rectangular mesh, so the 
number of degrees of freedom was M = 79^ = 6241. We solved the linear 
system ([6]) using fast sine transforms. Taking A^ = 16000 time steps, we first 
computed the (slow) DC solution U and then the perturbed (fast) solution U 
for r = 4, 5, 6 choosing 77 as in Proposition l6.2[ The table shows the maximum 
nodal error maxi<„<7v,i<m<M \U^ — u{xm,tn)\, and also the CPU times in 
seconds, broken down into three parts. The setup phase covers computing 
the f3nj or f3nj, and for the fast method the cost of constructing the cluster 
tree and admissible covers. The RHS phase covers the computation of the 
right-hand side of ([H]), and the solver phase is the total CPU time used by 
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the elliptic solver. 

The cluster tree was [Q, G)-uniform for Q = 2 and G = 10, so there were 
2*^ = 1024 leaves. We see from the table that if r = 5 then the fast summation 
algorithm evaluates the right-hand side (RHS) in 17.7 seconds, compared to 
911 seconds for a direct evaluation, while maintaining the accuracy of the 
DG solution. 

A A lower bound 

For any real-valued, piecewise-constant V, 



^kn{B''V)V''= f BV{t)V{t)dt and J^kniV'f = f 

n=l "^0 „=i Jo 



vitydt 



so the next theorem shows that (IT5|l holds. 
Theorem A.l. If v : [0,T] — > M zs piecewise C^ then 



f Bv{t)v{t)dt>p^T''-^ f v{tfdt, 
Jo Jo 



where 



p. = vr-^|i-^sin(». (26) 



Proof. The assumption that v is piecewise C^ ensures Bv is continuous except 
for weak singularities at the breakpoints of v. Using the substitution t = tT 
for < r < 1, we see that it suffices to deal with the case T = 1. Denote the 
Laplace transform of u by 



ulz 



/■oo 

/ e-''u{t)dt ior^z>0, 
Jo 



and observe that, because uJu{z) = z ^ , ii we extend v by zero outside the 
interval [0, 1], then 

Bv{z) = zCji,{z)v{z) = z^'^'viyz). 



Applying the Plancherel Theorem, and noting that v{z) = v{z) because v is 
real-valued, we have 

u{t)v{t) dt= -- u{iy)v{-iy) dy. 

^T^ J-oo 
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In particular, 



1 -| /"OO 

v{tfdt = - I \v{iy)\^dy 
^ Jo 



(27) 



and 



Bv{t)v{t) dt = / Bv{t)v{t) dt= — {tyy-''v{iy)v{-ty) dy 
Jo 27r 



siniTTz/ /"^ i_^ 



TT 



(2^ 



yi-1t)(zi/)|^t/|/. 



^'y'v{t) dt 



2 ^1 

< / vit^dt 
Jo 



The estimate 

\H^y)\'< 

implies that, for any e > 0, 

/ \v{iy)\^dy< e / v{tf dt 
Jo Jo 

and therefore by ([2 



1 g /•! 1 

v{tfdt<- I v{tfdt+-l \v{iy)\^dy. 







Thus, for < e < vr. 



TT 



vr 



1 



1 1"^ f 

v{tfdy<- / (i//e)i-1{)(zi/)rrfi/<- 



U-l /-OO 



y ''|t)(2y)| c/y. 



which, in combination with f l28p . implies that the desired inequality holds 
with 



p^ = ei-(^_e; 



sm gVrz/ 



vr 



The choice e = vr(l — z/)/(2 — z/) maximises p,y and gives the formula fl26p . D 



B Computing the weights 

Since the diagonal weights present no difficulty, we assume throughout this 
appendix that 1 < j < "n. — 1. Denoting the distance between the centres of 
Ij and /„ by A,„j = t„_i/2 — tj-i/2^ we see from flT7|) that 



A 



raj 



3 1 rf^nj^ 



ujy-i{^nj +t + s)dtds 



31 <^n I ^ 
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and so 

f'kj/2 pk„/2 

/3nj = - 5^-1 (A„j + s, kn) ds = - B^^i{Anj + t, kj) dt. (29) 

J-kj/2 J-kn/2 

Although we can easily evaluate these integrals analytically, the resulting 
expressions are susceptible to loss of precision when kj and kn are small 
compared to A„j. 

Consider the problem of computing 

B^{t, k) = UJi+^{t + i/c) - UJi+^{t - Ik) 

when k is small compared to t, so that we have a difference of nearly equal 
numbers. The C99 standard library provides the functions expinl(x) and 
loglp(x) that approximate e^ — 1 and log(l + x) accurately even when x is 
close to zero, so we can avoid the loss of precision by noting that 



Bf,{t, k) = uji+f,{t + lk)D^ ( j where D^ 



{x) = l-{l-xY, (30) 



and evaluating D^{x) as — expnil(yUloglp(— a;)). 

However, even though we can compute -B^(t) accurately, we still face the 
problem that 

Pnj = B^[/\nj — 2"^J5 kn) ~ B^[/\nj + 2"^j' "^"/ \^^) 

is again a difference of nearly equal numbers, as is the alternative formula 

Pnj ^u\^nj 2 ™' j) ^u\^nj ' 2 "■' i/' 

or equivalently 

Pnj = ^l+u{tn — tj)Dj,\ — I — UJi^iy{tn — tj^i)Dj,\ 



^n j / \^n i^l 



^l+u{tn~l — tj-l)D,y\ j — UJl+,y{tn — tj_i)D,y 



''ji— 1 j — i/ \ ''?! J — 1 

When kj is small compared to Anj, the following sum gives a more accurate 
value for the weight /3„j. 



Theorem B.l. Ifl<j<n— 1, then there exists tnj € Ij such that 

n-l/2 — 

(2j9+l)!22p '"J (2r + l)!22^ 



n _ ^^ Bi,_2p-l{Anj, kn) i2p+l ^u-2r-l,j{tn-l/2 ^nj^'^n) ,2r+l 
Pnj - - 2_^ r9r,^ni92p '^i r9T.^ni92r '^J 

p=0 
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Proof. We use the first integral representation in (12^ . The Taylor expansion 

-Ojy— l(A„j + S, Kn) = / ^ -Di/-l-p(A„j, Kn) 7 
p=0 ^' 

Jo (2r-l)! 



implies that 






' 2 J P=0 

with the error term given by Er = E^ + i?~, where 



/•2'^i /■■^ fs — t)2r-l 
^-^= / / ^77^-^TTr^--l-2r(Ani±t,fcn)c?tC?S. 

By the Integral Mean Value Theorem, there exists 9^^ G [0, \kj\ such that 



^1 rs ( c. +^2r-l 



Ef = B,_^_2r{Kj±e%,K) 1^' [ %-^^dtds 



-^'--/o io (2r-l)! 



= 5^-i-2r(A„j ± ^^^., kn) ^2r + l)y 

Since A„j ± 9^- = tn-1/2 - {tj-1/2 =F 0^^) and tj_i/2 =F 0^^ G J^, by the Inter- 
mediate Value Theorem there exists t* • G /j such that 

Bu-l-2r{^nj + ^„j, ^n) + -Biy-l-2r( A„j — ^„j, A;„.) = 2-Bi^_i_2r (^n-1/2 " ^nj' ^"j- 

D 



By starting from the second integral representation in fl29|) . we obtain 
an expansion in odd powers of kn, instead of kj. For practical meshes, we 
generally have kj < kn, so the series in the theorem will be preferable. 

To determine the speed of convergence of the series, denote the pth term 
by 

7 _ _-Di/-2p-l(,A„j, K„j r,2p+l 

^~ {2p+l)\2^p ^ ' 

and note that since D_^{x) = —D^{x)/{1 — x] 



M 



-B^-2p-l{^nj, K) - (^iy-2pi^nj + 2^") n _ ^)2p+l-u ^°^ ^ ~ ^ ■ + -k ' 



'■nj ^ 2 
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We find that the ratio of successive terms is 

6p+i _ 1 2p + 2-z/2p+l-z/ D2p+3-iy{x) ( kj 



bp 4 2p + 3 2p + 2 D2p+i-uix)\Anj-lkr, 
and, since D^{x) — )■ 1 as yU — )■ oo for < a; < 1, 



hm ^'^' -( ^^ 



p-i-oo hp \2A„j — kn 

For instance, in the case of a uniform grid t„ = nk, this hmiting ratio is 
[2(n — j) — 1]~^, giving acceptable convergence for j < n — 2. li j = n — 1, 
then the hmiting ratio is kn-i/{kn-i + kn)-, so the convergence is relatively 
slow. We see from (131 p that 

= u;i+,,(/i;„) [l + x'' - (1 + x)''] , for X = k^-i/kn, 

and from symmetry we may assume kn-i < kn and thus < x < 1. To 
evaluate the difference in square brackets, we write 

n^ ,u ,^u , / I0g[(l+X)--1] 

(1 + xj = 1 + y wliere y = exp I 

computing (1 + xY — 1 as expnil(z/ loglp(x)). In this way, 

1 + ^- _ (1 + ^)- = ^'^ _ ^- = y- [(^/^)- _ 1] , 

and we compute {x/yY — 1 as expml[z/log(x/y)]. 

We remark that in the special case z/ = 1/2 one can evaluate (5nj niore 
easily. Firstly, 

Bi/2{t,k) 



r(3/2),/^Ti^+,/^rT^ 



and if we write /?*<> = \ ^nj * \kj o \kn for *, o G {+, — }, then 

Pnj = Bi/2{Anj — 2^j, ^n) " -Bl/2(A„j + ^kj, kn) 

k„ f I 1 



r(3/2) \R^+ + i?__ i?++ + R^ 

kn {R++ - i?„+) + {R+- - R-) 

r(3/2) (i?_+ + i?__)(i?++ + /?+_) 

itnl^j J- / I 



r(3/2) (i?_+ + R—){R++ + i?+_) Vi?++ + ^-+ ^+- + ^- 
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C Computing the Taylor approximations 

Recall from Theorem 13.11 that 

4>pn = (-l)^5j,_p(t„_i/2 - S, kn) 

SO fl3Q|) gives 

k 
4>pn = {-lyuJu-p+iitn - s)D„_p{x) where x = "^ _ . 

Since D^_p{x) = — (1 — xY~^Dp_y{x) and 1 — x = (t„-i — s)/(tn — s), we 
have 

4>pn = KpnDp^i,{x), where Kpn = {-ly^'^^f^ — — , 

and the Kpn can be computed via the recursion 

Kin = :frr\ ^^d Kp+i,„ = Kp^n tor 1 < p < r - 1. 

r(z/) tn-l - S 

Likewise, 

i)pj = Bp{t^^i/2 - s, kj) = - [{tj^i/2 -s + ^kjY - (t,_i/2 -s- IkjY] 



giving ipin = kj and 



-^ ((t,-i - s-)V^p, + ^ (t, - 5-)^) for 1 < p < r - 1. 
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